Introduction

Many region-scale analyses in archaeology begin with a simple question: How do site locations relate to landscape attributes, such as elevation, soil type, or distance to water or other resources. Such a question is the foundation of basic geospatial exploratory data analysis, and answering it for a set of landscape attributes is the first step towards doing more interesting things, from interpreting settlement patterns in the past (Kowalewski 2008), to the construction of sophisticated predictive models of site location (Graves McEwan 2012; Kohler and Parker 1986; Maschner and Stein 1995), to guiding settlement decision frameworks in agent-based simulations (e.g., Axtell et al. 2002; Griffin and Stanish 2007; Kohler and Varien 2012). In this tutorial, we will learn how to use R to load, view, and explore site location data, and perform a very basic statistical settlement pattern analysis relating site location to elevation.

Of course, archaeological site locations are often sensitive information, and it wouldn’t be prudent to provide them in tutorial like this. So instead of using cultural site locations, we’ll use a point dataset for which we can make a reasonable hypothesis concerning landscape attributes: cell tower locations from the US Federal Communications Commission. The cell tower data are somewhat difficult to work with, so I’ve distilled a snapshot of the database (accessed on February 14, 2017), and posted it online for easy download. We’ll go through the process of downloading them later in this tutorial. The hypothesis we’ll be testing is that cell towers are positioned in unusually high places on the landscape. This is similar to hypotheses we might make about archaeological site locations, perhaps having to do with defensibility (e.g., Bocinsky 2014; Martindale and Supernant 2009; Sakaguchi, Morin, and Dickie 2010) or intervisibility and signaling (e.g., Johnson 2003; Van Dyke et al. 2016).

This tutorial is an R Markdown HTML document, meaning that all of the code to perform the calculations presented here was run when this web page was built—the paper was compiled. “Executable papers” such as this one are fantastic for presenting reproducible research in such a way that data, analysis, and interpretation are each given equal importance. Feel free to use this analysis as a template for your own work. All data and code for performing this analysis are available on Github at https://github.com/bocinsky/r_tutorials.

Learning goals

In this short tutorial, you will learn how to:

Defining the study area

All landscape-scale analyses start with the definition of a study area. Since the cell tower dataset with which we’ll be working covers the whole USA, we could really set our study area to be anywhere. Here, we will download an ESRI shapefile of counties in the United States available from the US Census, and pick a county to serve as our study area. I’m going to pick Whitman county, Washington, because that’s where I live; feel free to choose your own county!

Files may be downloaded in R using many different functions, but perhaps the most straightforward is the download.file() function, which requires that you specify a url to the file you wish to download, and a destfile where the downloaded file should end up. As the counties shapefile is in a zip archive, we will also use the unzip() function, which requires you define an extraction directory (exdir).

# Download the 1:500000 scale counties shapefile from the US Census
download.file("http://www2.census.gov/geo/tiger/GENZ2015/shp/cb_2015_us_county_500k.zip",
              destfile = "OUTPUT/cb_2015_us_county_500k.zip")

# Unzip the file
unzip("OUTPUT/cb_2015_us_county_500k.zip",
      exdir = "OUTPUT/counties")

Navigate to the exdir you specified and check to make sure the shapefile is there.

Now it’s time to load the shapefile into R. We’ll be using the readOGR() function from the rgdal library, which reads a shapefile (and many other file formats) and stores it in memory as a Spatial* object of the sp library. The readOGR() function requires two parameters: a dsn (data source name) which is the directory where the shapefile is stored, and a layer which is the shapefile name (without the file extension). Other spatial file formats have different requirements for dsn and layer, so read the documentation (help(readOGR)) for more information.

# Load the shapefile
census_counties <- rgdal::readOGR(dsn = "OUTPUT/counties",
                          layer = "cb_2015_us_county_500k")
## OGR data source with driver: ESRI Shapefile 
## Source: "OUTPUT/counties", layer: "cb_2015_us_county_500k"
## with 3233 features
## It has 9 fields
## Integer64 fields read as strings:  ALAND AWATER
# Inspect the spatial object
census_counties
## class       : SpatialPolygonsDataFrame 
## features    : 3233 
## extent      : -179.1489, 179.7785, -14.5487, 71.36516  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0 
## variables   : 9
## names       : STATEFP, COUNTYFP, COUNTYNS,       AFFGEOID, GEOID,    NAME, LSAD,      ALAND,   AWATER 
## min values  :      01,      001, 00023901, 0500000US01001, 01001, Añasco,   00, 1000508841,        0 
## max values  :      78,      840, 02516404, 0500000US78030, 78030, Ziebach,   25, 9985023570, 99900775

When we inspect the census_counties object, we see that it is a SpatialPolygonsDataFrame object with 3233 features. SpatialPolygonsDataFrame objects have an associated data table, in which we can see many fields including one called “NAME”.

Now it’s time to extract just the county we want to define our study area. Because a SpatialPolygonsDataFrame object extends the data.frame class, we can perform selection just as we would with a data.frame. We do that here:

# Select Whitman county
my_county <- census_counties[census_counties$NAME == "Whitman",]

# Inspect the spatial object
my_county
## class       : SpatialPolygonsDataFrame 
## features    : 1 
## extent      : -118.2492, -117.0394, 46.4171, 47.26045  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0 
## variables   : 9
## names       : STATEFP, COUNTYFP, COUNTYNS,       AFFGEOID, GEOID,    NAME, LSAD,      ALAND,   AWATER 
## min values  :      53,      075, 01533501, 0500000US53075, 53075, Whitman,   06, 5591999422, 47939386 
## max values  :      53,      075, 01533501, 0500000US53075, 53075, Whitman,   06, 5591999422, 47939386

As you can see, the spatial object now has only one feature, and it is Whitman county! We’ll map it in a minute, but first let’s do two more things to make our lives easier down the road. We’ll be mapping using the leaflet package, which makes pretty, interactive web maps. For leaflet, we need the spatial data to be in geographic coordinates (longitude/latitude) using the WGS84 ellipsoid. Here, we’ll transform our county to that projection system using the spTransform() function, then get the rectangular extent of our county using a function called polygon_from_extent() available in the FedData package (more on that package later).

This code chunk also uses something new: the pipe operator %>% from the magrittr package. The pipe operator enables you to “pipe” a value forward into an expression or function call—whatever is on the left hand side of the pipe becomes the first argument to the function on the right hand side. So, for example, to find the mean of the numeric vector c(1,2,3,5) by typing mean(c(1,2,3,5)), we could instead use the pipe: c(1,2,3,5) %>% mean(). Try running both versions; you should get 2.75 for each. The pipe isn’t much use for such a simple example, but becomes really helpful for code readability when chaining together many different functions. The compound assignment operator %<>% pipes an object forward into a function or call expression and then updates the left hand side object with the resulting value, and is equivalent to x <- x %>% fun()

# Transform to geographic coordinates
my_county %<>%
  sp::spTransform("+proj=longlat")

# Get a polygon of the rectangular extent of Whitman county. This is our study area.
my_county_extent <- my_county %>%
  FedData::polygon_from_extent()

Reading “site” locations from a table and cropping to a study area

Alright, now that we’ve got our study area defined, we can load our “site” data (the cell towers). We can use the read_csv() function from the readr library to read the cell tower locations straight from where I’ve archived them online. We’ll read them in, and then print them.

# Load cell tower location data 
cell_towers <- readr::read_csv("data/cell_towers.csv")
## Parsed with column specification:
## cols(
##   `Unique System Identifier` = col_integer(),
##   `Entity Name` = col_character(),
##   `Height of Structure` = col_double(),
##   Latitude = col_double(),
##   Longitude = col_double()
## )
# Load cell tower location data straight from the internet using a URL path
# cell_towers <- readr::read_csv("https://raw.githubusercontent.com/bocinsky/r_tutorials/master/data/cell_towers.csv")

cell_towers
## # A tibble: 101,388 × 5
##    `Unique System Identifier`                      `Entity Name`
##                         <int>                              <chr>
## 1                       96974        Crown Castle GT Company LLC
## 2                       96978             SBC TOWER HOLDINGS LLC
## 3                       96979       CHAMPAIGN TOWER HOLDINGS LLC
## 4                       96981                          CCATT LLC
## 5                       96989         Crown Atlantic Company LLC
## 6                       96999             SBC TOWER HOLDINGS LLC
## 7                       97007               American Towers, LLC
## 8                       97011 UNITED STATES CELLULAR CORPORATION
## 9                       97014             Crown Castle South LLC
## 10                      97022        Crown Castle GT Company LLC
## # ... with 101,378 more rows, and 3 more variables: `Height of
## #   Structure` <dbl>, Latitude <dbl>, Longitude <dbl>

As you can see, the cell tower data includes basic identification information as well as geographic coordinates in longitude and latitude. We can use the coordinate data to promote the data frame to a spatial object using the coordinates() and proj4string() functions. Finally, we can use the crop() function from the raster package to select only the cell towers in our study area.

# Create a SpatialPointsDataFrame by adding coordinates
coordinates(cell_towers) <- ~Longitude+Latitude

# And set the projection information
proj4string(cell_towers) <- "+proj=longlat"

# Select cell towers in our study area
cell_towers %<>%
  crop(my_county_extent)

cell_towers
## class       : SpatialPointsDataFrame 
## features    : 45 
## extent      : -118.2042, -117.0461, 46.42242, 47.20611  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=WGS84 
## variables   : 3
## names       : Unique System Identifier,                    Entity Name, Height of Structure 
## min values  :                   597094, Amateur Radio Club Rho Epsilon,                 9.4 
## max values  :                  2692687,             Whitman, County Of,               148.4

Now we see that there are 45 cell towers in our study area.

Visualizing site locations

There are many different ways to visualize spatial data in R, but perhaps the most useful is using the leaflet package, which allows you to make interactive HTML maps that will impress your friends, are intuitive to your 4-year-old niece, and will thoroughly confuse your advisor. I’m not going to go through all of the syntax here, but in general leaflet allows you to layer spatial objects over open-source map tiles to create pretty, interactive maps. Here, we’ll initialize a map, add several external base layer tiles, and then overlay our county extent and cell tower objects. leaflet is provided by the good folks at RStudio, and is well documented here. Zoom in on the satellite view, and you can see the cell towers!

# Create a quick plot of the locations
leaflet(width = "100%") %>% # This line initializes the leaflet map, and sets the width of the map at 100% of the window
  addProviderTiles("OpenTopoMap", group = "Topo") %>% # This line adds the topographic map from Garmin
  addProviderTiles("OpenStreetMap.BlackAndWhite", group = "OpenStreetMap") %>% # This line adds the OpenStreetMap tiles
  addProviderTiles("Esri.WorldImagery", group = "Satellite") %>% # This line adds orthoimagery from ESRI
  addProviderTiles("Stamen.TonerLines", # This line and the next adds roads and labels to the orthoimagery layer
                   group = "Satellite") %>%
  addProviderTiles("Stamen.TonerLabels",
                   group = "Satellite") %>%
  addPolygons(data = my_county_extent, # This line adds the Whitman county extent polygon
              label = "My County Extent",
              fill = FALSE,
              color = "black") %>%
  addPolygons(data = my_county, # This line adds the Whitman county polygon
              label = "My County",
              fill = FALSE,
              color = "red") %>%
  addMarkers(data = cell_towers,
             popup = ~htmlEscape(`Entity Name`)) %>% # This line adds cell tower locations
  addLayersControl( # This line adds a controller for the background layers
    baseGroups = c("Topo", "OpenStreetMap", "Satellite"),
    options = layersControlOptions(collapsed = FALSE),
    position = "topleft")

It’s obvious from this map that the cell towers aren’t merely situated to be in high places—they also tend to cluster along roads and in densely populated areas.

Downloading landscape data with FedData

FedData is an R package that is designed to take much of the pain out of downloading and preparing data from federated geospatial databases. For an area of interest (AOI) that you specify, each function in FedData will download the requisite raw data, crop the data to your AOI, and mosaic the data, including merging any tabular data. Currently, FedData has functions to download and prepare these datasets:

In this analysis, we’ll be downloading the 1 arc-second elevation data from the NED under our study area. The FedData functions each require four basic parameters:

Here, we’ll download the 1 arc-second NED with the get_ned() function from FedData, using the my_county SpatialPolygonsDataFrame object that we created above as out template, and local relative paths for our raw.dir and extraction.dir. We’ll download and prepare the NED, and then plot it using the basic plot() function.

# Download the 1 arc-second NED elevation model for our study area
my_county_NED <- FedData::get_ned(template = my_county,
                                label = "my_county",
                                raw.dir = "OUTPUT/RAW/NED",
                                extraction.dir = "OUTPUT/EXTRACTIONS/NED")

# Print the my_county_NED object
my_county_NED
## class       : RasterLayer 
## dimensions  : 3037, 4357, 13232209  (nrow, ncol, ncell)
## resolution  : 0.0002777778, 0.0002777778  (x, y)
## extent      : -118.2494, -117.0392, 46.41694, 47.26056  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs 
## data source : E:\My Documents\My Papers\conferences\SAA2017\SAA2017 R forum\Bocinsky\OUTPUT\EXTRACTIONS\NED\my_county_NED_1.tif 
## names       : my_county_NED_1
# Plot the my_county_NED object
my_county_NED %>%
  plot()

# Plot the my_county polygon over the elevation raster
my_county %>%
  plot(add = T)

The NED elevation data was downloaded for our study area, and cropped to the rectangular extent of the county.

Are sites situated based on elevation?

Extract site elevations

# Extract cell tower elevations from the study area NED values
cell_towers$`Elevation (m)` <- my_county_NED %>%
  raster::extract(cell_towers)
## Warning in .local(x, y, ...): Transforming SpatialPoints to the CRS of the
## Raster
cell_towers@data
## # A tibble: 45 × 4
##    `Unique System Identifier`                   `Entity Name`
##                         <int>                           <chr>
## 1                      597094               Qwest Corporation
## 2                      597409     NORTHWEST LEASE ONE, L.L.C.
## 3                      597902 SINCLAIR LEWISTON LICENSEE, LLC
## 4                      598225             Inland Cellular LLC
## 5                      598624              AVISTA CORPORATION
## 6                      598630              AVISTA CORPORATION
## 7                      599225     WASHINGTON STATE UNIVERSITY
## 8                      609045          UNION PACIFIC RAILROAD
## 9                      609046          UNION PACIFIC RAILROAD
## 10                     612883            PALOUSE COUNTRY INC.
## # ... with 35 more rows, and 2 more variables: `Height of
## #   Structure` <dbl>, `Elevation (m)` <dbl>

Plot kernel density curves for sites and landscape

cell_towers_densities <- cell_towers$`Elevation (m)` %>%
  density(from = 150,
            to = 1250,
            n = 1101) %>% 
    tidy() %>%
    tibble::as_tibble() %>%
  dplyr::mutate(y = y * 1101) %>%
  dplyr::rename(Elevation = x,
                Frequency = y)
  

# Load the NED elevations into memory for fast bootstrapping
my_county_NED_values <- my_county_NED %>%
  values()

# Draw 999 random samples, and calculate their densities
my_county_NED_densities <- foreach(n = 1:999, .combine = rbind) %do% {
  my_county_NED_values %>%
    sample(length(cell_towers),
           replace = FALSE) %>%
    density(from = 150,
            to = 1250,
            n = 1101) %>% 
    tidy() %>%
    tibble::as_tibble() %>%
  dplyr::mutate(y = y * 1101)
} %>%
  group_by(x) %>%
  # by_slice(~ quantile(.$y, probs = c(0.025, 0.5, 0.975)))
  do({
    quantile(.$y, probs = c(0.025, 0.5, 0.975)) %>%
      t() %>%
      tidy()
  }) %>%
  set_names(c("Elevation", "Lower CI", "Frequency", "Upper CI"))

g <- ggplot() +
  geom_line(data = my_county_NED_densities,
            mapping = aes(x = Elevation,
                          y = Frequency)) +
  geom_ribbon(data = my_county_NED_densities,
              mapping = aes(x = Elevation,
                            ymin = `Lower CI`,
                            ymax = `Upper CI`),
              alpha = 0.3) +
  geom_line(data = cell_towers_densities,
               mapping = aes(x = Elevation,
                             y = Frequency),
               color = "red")

ggplotly(g)

Mann-Whitney U test comparing non-normal distributions

# Draw 999 random samples from the NED, and compute two-sample Wilcoxon tests (Mann-Whitney U tests)
my_county_Cell_MWU <- foreach(n = 1:999, .combine = rbind) %do% {
  my_county_sample <- my_county_NED_values %>%
    sample(length(cell_towers),
           replace = FALSE) %>%
    wilcox.test(x = cell_towers$`Elevation (m)`,
                y = .,
                alternative = "greater",
                exact = FALSE) %>%
    tidy() %>%
    tibble::as_tibble()
  
} %>%
  dplyr::select(statistic, p.value)

my_county_Cell_MWU <- foreach::foreach(prob = c(0.025,0.5,0.975), .combine = rbind) %do% {
  my_county_Cell_MWU %>%
      dplyr::summarise_all(quantile, probs = prob)
} %>%
  t() %>%
  magrittr::set_colnames(c("Lower CI","Median","Upper CI")) %>%
  magrittr::set_rownames(c("U statistic","p-value"))

my_county_Cell_MWU %T>%
  write.csv("OUTPUT/Mann_Whitney_results.csv") # Write output table as a CSV
##                 Lower CI       Median     Upper CI
## U statistic 1.499900e+03 1.662000e+03 1.791000e+03
## p-value     1.712279e-10 8.149979e-08 4.262999e-05

The results of the bootstrapped Mann-Whitney U two-sample tests demonstrate that it is highly likely that the cell towers in Whitman county were placed on unusually high places on the landscape (median U statistic = 1662, median p-value = 8.14997910^{-8}).

Conclusions

R is extremely powerful as a geo-analytic tool, and encountering sophisticated code for the first time can be daunting. But it can also be useful for exploratory analysis, interactive data presentation, and

References cited

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## locale:
## [1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
## [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] FedData_2.4.5   htmltools_0.3.5 leaflet_1.1.0   raster_2.5-8   
##  [5] rgdal_1.2-5     sp_1.2-4        plotly_4.5.6    ggplot2_2.2.1  
##  [9] broom_0.4.2     tidyr_0.6.1     dplyr_0.5.0     tibble_1.2     
## [13] purrr_0.2.2     foreach_1.4.3   magrittr_1.5   
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.2      lattice_0.20-34     colorspace_1.3-2   
##  [4] viridisLite_0.1.3   yaml_2.1.14         base64enc_0.1-3    
##  [7] foreign_0.8-67      DBI_0.5-1           plyr_1.8.4         
## [10] stringr_1.2.0       rgeos_0.3-22        munsell_0.4.3      
## [13] gtable_0.2.0        htmlwidgets_0.8     codetools_0.2-15   
## [16] psych_1.6.12        evaluate_0.10       labeling_0.3       
## [19] knitr_1.15.1        doParallel_1.0.10   httpuv_1.3.3       
## [22] crosstalk_1.0.0     parallel_3.3.2      Rcpp_0.12.9        
## [25] xtable_1.8-2        readr_1.0.0         scales_0.4.1       
## [28] backports_1.0.5     jsonlite_1.3        mime_0.5           
## [31] mnormt_1.5-5        digest_0.6.12       stringi_1.1.2      
## [34] shiny_1.0.0         ncdf4_1.15          grid_3.3.2         
## [37] rprojroot_1.2       tools_3.3.2         lazyeval_0.2.0.9000
## [40] data.table_1.10.4   lubridate_1.6.0     assertthat_0.1     
## [43] rmarkdown_1.3.9004  httr_1.2.1          iterators_1.0.8    
## [46] R6_2.2.0            nlme_3.1-131        compiler_3.3.2

Axtell, Robert L., Joshua M. Epstein, Jeffrey S. Dean, George J. Gumerman, Alan C. Swedlund, Jason Harburger, Shubha Chakravarty, Ross Hammond, Jon Parker, and Miles Parker. 2002. “Population Growth and Collapse in a Multiagent Model of the Kayenta Anasazi in Long House Valley.” Proceedings of the National Academy of Sciences 99 (suppl 3): 7275–9.

Bocinsky, R. Kyle. 2014. “Extrinsic Site Defensibility and Landscape-Based Archaeological Inference: An Example from the Northwest Coast.” Journal of Anthropological Archaeology 35: 164–76.

Graves McEwan, Dorothy. 2012. “Qualitative Landscape Theories and Archaeological Predictive Modelling—A Journey Through No Man’s Land?” Journal of Archaeological Method and Theory 19: 526–47.

Griffin, Arthur F., and Charles Stanish. 2007. “An Agent-Based Model of Prehistoric Settlement Patterns and Political Consolidation in the Lake Titicaca Basin of Peru and Bolivia.” Structure and Dynamics 2 (2): 1–47 [online].

Johnson, C David. 2003. “Mesa Verde Region Towers: A View from Above.” Kiva 68 (4). JSTOR: 323–40.

Kohler, Timothy A., and Sandra C. Parker. 1986. “Predictive Models for Archaeological Resource Location.” Advances in Archaeological Method and Theory 9: 397–452.

Kohler, Timothy A., and Mark D. Varien, eds. 2012. Emergence and Collapse of Early Villages: Models of Central Mesa Verde Archaeology. Berkeley, California: University of California Press.

Kowalewski, Stephen A. 2008. “Regional Settlement Pattern Studies.” Journal of Archaeological Research 16: 225–85.

Martindale, Andrew, and Kisha Supernant. 2009. “Quantifying the Defensiveness of Defended Sites on the Northwest Coast of North America.” Journal of Anthropological Archaeology 28 (2): 191–204.

Maschner, Herbert D.G., and Jeffrey W. Stein. 1995. “Multivariate Approaches to Site Location on the Northwest Coast of North America.” Antiquity 69 (262): 61–73.

Sakaguchi, Takashi, Jesse Morin, and Ryan Dickie. 2010. “Defensibility of Large Prehistoric Sites in the Mid-Fraser region on the Canadian Plateau.” Journal of Archaeological Science 37 (6): 1171–85.

Van Dyke, Ruth M., R. Kyle Bocinsky, Tucker Robinson, and Thomas C. Windes. 2016. “Great Houses, Shrines, and High Places: Intervisibility in the Chacoan World.” American Antiquity 81 (2): 205–30.